Twitter's massive user base of 330 million monthly active users presents a direct avenue for businesses to connect with a broad audience. However, the vast amount of information on the platform makes it challenging for brands to swiftly detect negative social mentions that may impact their reputation. To tackle this, sentiment analysis has become a crucial tool in social media marketing, enabling businesses to monitor emotions in conversations, understand customer sentiments, and gain insights to stay ahead in their industry.
That's why sentiment analysis/classification, which involves monitoring emotions in conversations on social media platforms, has become a key strategy in social media marketing.
The aim of this project is to build a sentimental analysis model that classifies the sentiment of tweets of US Airlines customers into positive, neutral & negative.
I will employ a systematic approach to develop and choose the best model for this Twitter Sentiment Analysis task.
We will use Bag of Words (BoW) with Random Forrest Classifier, TF-IDF with Random Forrest Classifier & Keras Tokenizer with Long-Short Term Memory (LSTM) & GLoVE with LSTM and then we choose the best model.
Our approach will involve the following key steps:
◎ Bag of Words (BoW): Transform the text data into a matrix of word frequencies.
◎ TF-IDF (Term Frequency-Inverse Document Frequency): Represent the text data using TF-IDF scores to give importance to rare words.
◎ Word Embeddings - GloVe: Capture semantic relationships between words in dense vector representations.
# install and import necessary libraries.
import numpy as np # Import numpy.
import pandas as pd # Import pandas.
import matplotlib.pyplot as plt # Import Matplotlib
# To help with data visualization
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns # Import seaborn
sns.set(
color_codes=True
) # -----This adds a background color to all the plots created using seaborn
# Allow the use of Display via interactive Python
from IPython.display import display
# Import library for exploratory visualization of missing data.
import missingno as ms
from sklearn.feature_extraction.text import CountVectorizer # Import count Vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer # Import Tf-Idf vector
from sklearn.model_selection import train_test_split # Import train test split
from sklearn.ensemble import RandomForestClassifier # Import Rndom Forest Classifier
from sklearn.model_selection import cross_val_score # Import cross val score
from sklearn.metrics import confusion_matrix # Import confusion matrix
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder # To convert target variable to numeric
from tensorflow.keras.metrics import Precision, Recall
import re, string, unicodedata # Import Regex, string and unicodedata.
import contractions # Import contractions library.
from bs4 import BeautifulSoup # Import BeautifulSoup.
import nltk # Import Natural Language Tool-Kit.
from nltk.corpus import stopwords # Import stopwords.
from nltk.tokenize import word_tokenize, sent_tokenize # Import Tokenizer.
from nltk.stem.wordnet import WordNetLemmatizer # Import Lemmatizer.
# nltk.download("omw-1.4") # Package omw-1.4 is already up-to-date!
# nltk.download("stopwords") # Package stopwords is already up-to-date!
# nltk.download("punkt") # Package punkt is already up-to-date!
# nltk.download("wordnet") # Package wordnet is already up-to-date!
from wordcloud import WordCloud, STOPWORDS # Import WorldCloud and Stopwords
from tensorflow.keras.preprocessing.text import Tokenizer # Import Keras Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences # Import padding
# Sequential: This allows us to create a linear stack of layers for building neural networks
from tensorflow.keras.models import Sequential
# Import the different layers we will use in our sequential neural network
from tensorflow.keras.layers import (
Embedding,
Bidirectional,
LSTM,
Dense,
Dropout,
SpatialDropout1D,
)
from tensorflow.keras.optimizers import Adam # Importing the Adam optimizer algorithm
# Import the different callbacks to be used
from tensorflow.keras.callbacks import (
LearningRateScheduler,
History,
EarlyStopping,
ModelCheckpoint,
)
# To supress warnings
import warnings
warnings.filterwarnings("ignore")
# Making the Python code more structured automatically
%reload_ext nb_black
# Define maximum number of columns to be displayed in a dataframe
pd.set_option("display.max_columns", None)
2023-09-15 00:13:40.867542: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE4.1 SSE4.2 To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
data = pd.read_csv("Tweets.csv") # Code to read the dataset
# Making a copy of the data to avoid any changes to original data
df = data.copy()
print("Loading Dataset... Done.")
Loading Dataset... Done.
# Checking the top 5, botom 5 and 10 random rows
display(df.head()) # -----looking at head (top 5 observations)
display(df.tail()) # -----looking at tail (bottom 5 observations)
display(
df.sample(5, random_state=1)
) # -----5 random sample of observations from the data
| tweet_id | airline_sentiment | airline_sentiment_confidence | negativereason | negativereason_confidence | airline | airline_sentiment_gold | name | negativereason_gold | retweet_count | text | tweet_coord | tweet_created | tweet_location | user_timezone | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 570306133677760513 | neutral | 1.0000 | NaN | NaN | Virgin America | NaN | cairdin | NaN | 0 | @VirginAmerica What @dhepburn said. | NaN | 2015-02-24 11:35:52 -0800 | NaN | Eastern Time (US & Canada) |
| 1 | 570301130888122368 | positive | 0.3486 | NaN | 0.0000 | Virgin America | NaN | jnardino | NaN | 0 | @VirginAmerica plus you've added commercials t... | NaN | 2015-02-24 11:15:59 -0800 | NaN | Pacific Time (US & Canada) |
| 2 | 570301083672813571 | neutral | 0.6837 | NaN | NaN | Virgin America | NaN | yvonnalynn | NaN | 0 | @VirginAmerica I didn't today... Must mean I n... | NaN | 2015-02-24 11:15:48 -0800 | Lets Play | Central Time (US & Canada) |
| 3 | 570301031407624196 | negative | 1.0000 | Bad Flight | 0.7033 | Virgin America | NaN | jnardino | NaN | 0 | @VirginAmerica it's really aggressive to blast... | NaN | 2015-02-24 11:15:36 -0800 | NaN | Pacific Time (US & Canada) |
| 4 | 570300817074462722 | negative | 1.0000 | Can't Tell | 1.0000 | Virgin America | NaN | jnardino | NaN | 0 | @VirginAmerica and it's a really big bad thing... | NaN | 2015-02-24 11:14:45 -0800 | NaN | Pacific Time (US & Canada) |
| tweet_id | airline_sentiment | airline_sentiment_confidence | negativereason | negativereason_confidence | airline | airline_sentiment_gold | name | negativereason_gold | retweet_count | text | tweet_coord | tweet_created | tweet_location | user_timezone | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 14635 | 569587686496825344 | positive | 0.3487 | NaN | 0.0000 | American | NaN | KristenReenders | NaN | 0 | @AmericanAir thank you we got on a different f... | NaN | 2015-02-22 12:01:01 -0800 | NaN | NaN |
| 14636 | 569587371693355008 | negative | 1.0000 | Customer Service Issue | 1.0000 | American | NaN | itsropes | NaN | 0 | @AmericanAir leaving over 20 minutes Late Flig... | NaN | 2015-02-22 11:59:46 -0800 | Texas | NaN |
| 14637 | 569587242672398336 | neutral | 1.0000 | NaN | NaN | American | NaN | sanyabun | NaN | 0 | @AmericanAir Please bring American Airlines to... | NaN | 2015-02-22 11:59:15 -0800 | Nigeria,lagos | NaN |
| 14638 | 569587188687634433 | negative | 1.0000 | Customer Service Issue | 0.6659 | American | NaN | SraJackson | NaN | 0 | @AmericanAir you have my money, you change my ... | NaN | 2015-02-22 11:59:02 -0800 | New Jersey | Eastern Time (US & Canada) |
| 14639 | 569587140490866689 | neutral | 0.6771 | NaN | 0.0000 | American | NaN | daviddtwu | NaN | 0 | @AmericanAir we have 8 ppl so we need 2 know h... | NaN | 2015-02-22 11:58:51 -0800 | dallas, TX | NaN |
| tweet_id | airline_sentiment | airline_sentiment_confidence | negativereason | negativereason_confidence | airline | airline_sentiment_gold | name | negativereason_gold | retweet_count | text | tweet_coord | tweet_created | tweet_location | user_timezone | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 8515 | 568198336651649027 | positive | 1.0000 | NaN | NaN | Delta | NaN | GenuineJack | NaN | 0 | @JetBlue I'll pass along the advice. You guys ... | NaN | 2015-02-18 16:00:14 -0800 | Massachusetts | Central Time (US & Canada) |
| 3439 | 568438094652956673 | negative | 0.7036 | Lost Luggage | 0.7036 | United | NaN | vina_love | NaN | 0 | @united I sent you a dm with my file reference... | NaN | 2015-02-19 07:52:57 -0800 | ny | Quito |
| 6439 | 567858373527470080 | positive | 1.0000 | NaN | NaN | Southwest | NaN | Capt_Smirk | NaN | 0 | @SouthwestAir Black History Commercial is real... | NaN | 2015-02-17 17:29:21 -0800 | La Florida | Eastern Time (US & Canada) |
| 5112 | 569336871853170688 | negative | 1.0000 | Late Flight | 1.0000 | Southwest | NaN | scoobydoo9749 | NaN | 0 | @SouthwestAir why am I still in Baltimore?! @d... | [39.1848041, -76.6787131] | 2015-02-21 19:24:22 -0800 | Tallahassee, FL | America/Chicago |
| 5645 | 568839199773732864 | positive | 0.6832 | NaN | NaN | Southwest | NaN | laurafall | NaN | 0 | @SouthwestAir SEA to DEN. South Sound Volleyba... | NaN | 2015-02-20 10:26:48 -0800 | NaN | Pacific Time (US & Canada) |
Observations
From the first & last few rows and the sample rows, the dataset has been loaded properly
airline_sentiment is our target variable and it will be converted to numerical digits.
We will drop rows like tweet_id, name... as they will add no value to our models.
df.shape # Code to get the shape of data
(14640, 15)
Observations
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 14640 entries, 0 to 14639 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 tweet_id 14640 non-null int64 1 airline_sentiment 14640 non-null object 2 airline_sentiment_confidence 14640 non-null float64 3 negativereason 9178 non-null object 4 negativereason_confidence 10522 non-null float64 5 airline 14640 non-null object 6 airline_sentiment_gold 40 non-null object 7 name 14640 non-null object 8 negativereason_gold 32 non-null object 9 retweet_count 14640 non-null int64 10 text 14640 non-null object 11 tweet_coord 1019 non-null object 12 tweet_created 14640 non-null object 13 tweet_location 9907 non-null object 14 user_timezone 9820 non-null object dtypes: float64(2), int64(2), object(11) memory usage: 1.7+ MB
Observations
# Few Select Categorical Columns
cat_cols = ["airline", "airline_sentiment", "airline_sentiment_gold", "negativereason"]
# Check the unique values of select categorical variables
for i in cat_cols:
print("Unique values % in", i, "are :")
print(df[i].value_counts(normalize=True) * 100)
print("*" * 50)
print("\n")
Unique values % in airline are : United 26.106557 US Airways 19.897541 American 18.845628 Southwest 16.530055 Delta 15.177596 Virgin America 3.442623 Name: airline, dtype: float64 ************************************************** Unique values % in airline_sentiment are : negative 62.691257 neutral 21.168033 positive 16.140710 Name: airline_sentiment, dtype: float64 ************************************************** Unique values % in airline_sentiment_gold are : negative 80.0 positive 12.5 neutral 7.5 Name: airline_sentiment_gold, dtype: float64 ************************************************** Unique values % in negativereason are : Customer Service Issue 31.706254 Late Flight 18.141207 Can't Tell 12.965788 Cancelled Flight 9.228590 Lost Luggage 7.888429 Bad Flight 6.319460 Flight Booking Problems 5.763783 Flight Attendant Complaints 5.240793 longlines 1.939420 Damaged Luggage 0.806276 Name: negativereason, dtype: float64 **************************************************
Observations
airline_sentiment & airline_sentiment_gold, are mostly negative with airline_sentiment having 63% negative sentiments and airline_sentiment_gold 80%negativereason at about 32% followed by Late Flight with about 18%# Checking missing values across each columns
c_missing = pd.Series(df.isnull().sum(), name="Missing Count") # -----Count Missing
p_missing = pd.Series(
round(df.isnull().sum() / df.shape[0] * 100, 2), name="% Missing"
) # -----Percentage Missing
# Combine the Count and Percentage into 1 Dataframe
missing_df = pd.concat([c_missing, p_missing], axis=1)
missing_df.sort_values(by="% Missing", ascending=False).style.background_gradient(
cmap="YlOrRd"
)
| Missing Count | % Missing | |
|---|---|---|
| negativereason_gold | 14608 | 99.780000 |
| airline_sentiment_gold | 14600 | 99.730000 |
| tweet_coord | 13621 | 93.040000 |
| negativereason | 5462 | 37.310000 |
| user_timezone | 4820 | 32.920000 |
| tweet_location | 4733 | 32.330000 |
| negativereason_confidence | 4118 | 28.130000 |
| tweet_id | 0 | 0.000000 |
| airline_sentiment | 0 | 0.000000 |
| airline_sentiment_confidence | 0 | 0.000000 |
| airline | 0 | 0.000000 |
| name | 0 | 0.000000 |
| retweet_count | 0 | 0.000000 |
| text | 0 | 0.000000 |
| tweet_created | 0 | 0.000000 |
# Visual Exploration of Missing Values
# Plot missing values across each columns
plt.title("Missing Values Graph", fontsize=20)
ms.bar(df)
<Axes: title={'center': 'Missing Values Graph'}>
Observations
negativereason_gold and airline_sentiment_gold are missing 99.78% & 99.73% with only 40 and 32 entries respectively. Too many missing entries. These columns will be deleted.
negativereason, user_timezone, tweet_location and negativereason_confidence are missing over 28%. These columns will also be deleted because they are not crucial to our sentiment analysis, also imputing the missing values may lead to misleading conclusions.
Our two most important features, airline_sentiment and text, have no missing values.
It is unfortunate that we have to drop airline_sentiment_gold because that is the groundtruth label.
# Checking for duplicate records
df.duplicated().sum()
36
# Remove duplicate rows based on all columns
df = df.drop_duplicates()
# Checking for duplicate records
df.duplicated().sum()
0
Observations
# -----
# User defined function to plot labeled_barplot
# -----
def labeled_barplot(data, feature, perc=False, v_ticks=True, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 1, 5))
else:
plt.figure(figsize=(n + 1, 5))
if v_ticks is True:
plt.xticks(rotation=90)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n].sort_values(),
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
# Code to plot the Percentage of Tweets for each Airline
labeled_barplot(
df, "airline", perc=True
) # Code to plot the labeled barplot for airline
Observations
# Code to plot the Distribution of Sentiments across all the Tweets
labeled_barplot(
df, "airline_sentiment", perc=True
) # Code to plot the labeled barplot for airline_sentiment
Observations
# Code to shot the plot of all the Negative Reasons
labeled_barplot(
df, "negativereason", perc=True
) # Cdode to plot the labeled barplot for negative reason
Observations
negativereason followed by Late Flightlonglines and Damaged Luggage are not as significant# Code to plot the barplot for the distribution of each airline with total sentiments
airline_sentiment = (
df.groupby(["airline", "airline_sentiment"]).airline_sentiment.count().unstack()
)
airline_sentiment.plot(kind="bar")
<Axes: xlabel='airline'>
Observations
# Code to show the Distribution of Retweet of Sentiments for each Airline
colors = {"positive": "green", "negative": "red", "neutral": "blue"}
for sentiment in df["airline_sentiment"].unique():
subset = df[df["airline_sentiment"] == sentiment]
plt.bar(
subset["airline"],
subset["retweet_count"],
label=sentiment,
color=colors[sentiment],
)
plt.title("Airline Sentiments by Retweets by Airline")
plt.xlabel("Airline")
plt.ylabel("Retweet Count")
plt.legend(title="Sentiment", loc="upper right")
plt.xticks(rotation=90)
plt.show()
Observations
#####
# Helper function to create and display Wordcloud
#####
def show_wordcloud(data, title):
words = " ".join(data["text"])
cleaned_word = " ".join(
[
word
for word in words.split()
if "http" not in word and not word.startswith("@") and word != "RT"
]
)
# Create a WordCloud object
wordcloud = WordCloud(
stopwords=STOPWORDS, background_color="black", width=3000, height=2500
).generate(cleaned_word)
plt.figure(figsize=(14, 11), frameon=True)
plt.imshow(wordcloud)
plt.axis("off")
plt.title(title, fontsize=30)
plt.show()
# Code to display Wordcloud for Negative Tweets
df_negative = df[df["airline_sentiment"] == "negative"]
show_wordcloud(data=df_negative, title="Negative Tweets")
Observations
# Code to display Wordcloud for positive tweets
df_positive = df[df["airline_sentiment"] == "positive"]
show_wordcloud(data=df_negative, title="Positive Tweets")
Observations
# Code to display Wordcloud for neutral tweets
df_positive = df[df["airline_sentiment"] == "neutral"]
show_wordcloud(data=df_negative, title="Neutral Tweets")
Observations
# Extract text and airline sentiment columns from the data
model_df = df[["airline_sentiment", "text"]] # Code to get a subset of data
model_df.head() # Code to display the first 5 rows of the dataset
| airline_sentiment | text | |
|---|---|---|
| 0 | neutral | @VirginAmerica What @dhepburn said. |
| 1 | positive | @VirginAmerica plus you've added commercials t... |
| 2 | neutral | @VirginAmerica I didn't today... Must mean I n... |
| 3 | negative | @VirginAmerica it's really aggressive to blast... |
| 4 | negative | @VirginAmerica and it's a really big bad thing... |
model_df.shape # Code to get the shape of the data
(14604, 2)
model_df[
"airline_sentiment"
].value_counts() # Code to display the unique values in airline sentiment column
negative 9159 neutral 3091 positive 2354 Name: airline_sentiment, dtype: int64
model_df[
"airline_sentiment"
].unique() # Code to display the values in airline sentiment column
array(['neutral', 'positive', 'negative'], dtype=object)
Observations
airline_sentiment column has 3 unique values: 'neutral', 'positive', 'negative'.# Code to remove the html tage
def strip_html(text):
soup = BeautifulSoup(text, "html.parser")
return soup.get_text()
model_df["text"] = model_df["text"].apply(
strip_html
) # Code to apply strip html function on text column
model_df.head() # Code to display the head of the data
| airline_sentiment | text | |
|---|---|---|
| 0 | neutral | @VirginAmerica What @dhepburn said. |
| 1 | positive | @VirginAmerica plus you've added commercials t... |
| 2 | neutral | @VirginAmerica I didn't today... Must mean I n... |
| 3 | negative | @VirginAmerica it's really aggressive to blast... |
| 4 | negative | @VirginAmerica and it's a really big bad thing... |
Observations
def replace_contractions(text):
"""Replace contractions in string of text"""
return contractions.fix(text)
model_df["text"] = model_df["text"].apply(
replace_contractions
) # Code to apply replace contractions function on text column
model_df.head() # Code to display the head of the data
| airline_sentiment | text | |
|---|---|---|
| 0 | neutral | @VirginAmerica What @dhepburn said. |
| 1 | positive | @VirginAmerica plus you have added commercials... |
| 2 | neutral | @VirginAmerica I did not today... Must mean I ... |
| 3 | negative | @VirginAmerica it is really aggressive to blas... |
| 4 | negative | @VirginAmerica and it is a really big bad thin... |
Observations
def remove_numbers(text):
text = re.sub(r"\d+", "", text) # Code to remove numbers
return text
model_df["text"] = model_df["text"].apply(
remove_numbers
) # Code to apply remove numbers function on text column
model_df.head() # Code to display the head of the data
| airline_sentiment | text | |
|---|---|---|
| 0 | neutral | @VirginAmerica What @dhepburn said. |
| 1 | positive | @VirginAmerica plus you have added commercials... |
| 2 | neutral | @VirginAmerica I did not today... Must mean I ... |
| 3 | negative | @VirginAmerica it is really aggressive to blas... |
| 4 | negative | @VirginAmerica and it is a really big bad thin... |
Observations
data.apply(lambda row: nltk.word_tokenize(row["text"]), axis=1)
0 [@, VirginAmerica, What, @, dhepburn, said, .]
1 [@, VirginAmerica, plus, you, 've, added, comm...
2 [@, VirginAmerica, I, did, n't, today, ..., Mu...
3 [@, VirginAmerica, it, 's, really, aggressive,...
4 [@, VirginAmerica, and, it, 's, a, really, big...
...
14635 [@, AmericanAir, thank, you, we, got, on, a, d...
14636 [@, AmericanAir, leaving, over, 20, minutes, L...
14637 [@, AmericanAir, Please, bring, American, Airl...
14638 [@, AmericanAir, you, have, my, money, ,, you,...
14639 [@, AmericanAir, we, have, 8, ppl, so, we, nee...
Length: 14640, dtype: object
# Code to apply tokenization on text column
model_df["text"] = model_df.apply(lambda row: nltk.word_tokenize(row["text"]), axis=1)
# Code to display the head of the data
model_df.head()
| airline_sentiment | text | |
|---|---|---|
| 0 | neutral | [@, VirginAmerica, What, @, dhepburn, said, .] |
| 1 | positive | [@, VirginAmerica, plus, you, have, added, com... |
| 2 | neutral | [@, VirginAmerica, I, did, not, today, ..., Mu... |
| 3 | negative | [@, VirginAmerica, it, is, really, aggressive,... |
| 4 | negative | [@, VirginAmerica, and, it, is, a, really, big... |
Observations
# We want stop-word's like "not", "couldn't" etc. because these words matter in Sentiment analysis,
# so we will be removing them from original stopwords data.
stopwords = stopwords.words("english")
customlist = [
"not",
"couldn't",
"didn",
"didn't",
"doesn",
"doesn't",
"hadn",
"hadn't",
"hasn",
"hasn't",
"haven",
"haven't",
"isn",
"isn't",
"ma",
"mightn",
"mightn't",
"mustn",
"mustn't",
"needn",
"needn't",
"shan",
"shan't",
"shouldn",
"shouldn't",
"wasn",
"wasn't",
"weren",
"weren't",
"won",
"won't",
"wouldn",
"wouldn't",
]
# Removing our custom list from stopwords
stopwords = list(set(stopwords) - set(customlist))
lemmatizer = WordNetLemmatizer() # Instantiating the WordNetLemmatizer
#####
# Defining preprocessing helper functions
#####
def remove_non_ascii(words):
"""Remove non-ASCII characters from list of tokenized words"""
new_words = []
for word in words:
new_word = (
unicodedata.normalize("NFKD", word)
.encode("ascii", "ignore")
.decode("utf-8", "ignore")
)
new_words.append(new_word)
return new_words
def to_lowercase(words):
"""Convert all characters to lowercase from list of tokenized words"""
new_words = []
for word in words:
new_word = word.lower()
new_words.append(new_word)
return new_words
def remove_punctuation(words):
"""Remove punctuation from list of tokenized words"""
new_words = []
for word in words:
new_word = re.sub(r"[^\w\s]", "", word)
if new_word != "":
new_words.append(new_word)
return new_words
def remove_stopwords(words):
"""Remove stop words from list of tokenized words"""
new_words = []
for word in words:
if word not in stopwords:
new_words.append(word)
return new_words
def lemmatize_list(words):
new_words = []
for word in words:
new_words.append(lemmatizer.lemmatize(word, pos="v"))
return new_words
def normalize(words):
words = remove_non_ascii(words)
words = to_lowercase(words)
words = remove_punctuation(words)
words = remove_stopwords(words)
words = lemmatize_list(words)
return " ".join(words)
# Applying all the preprocessing functions on our corpus
model_df["text"] = model_df.apply(lambda row: normalize(row["text"]), axis=1)
model_df.head()
| airline_sentiment | text | |
|---|---|---|
| 0 | neutral | virginamerica dhepburn say |
| 1 | positive | virginamerica plus add commercials experience ... |
| 2 | neutral | virginamerica not today must mean need take an... |
| 3 | negative | virginamerica really aggressive blast obnoxiou... |
| 4 | negative | virginamerica really big bad thing |
Observations
text) has been normalized.#####
# Code to create a dataframe to store accuracy scores of each model
#####
# Create an empty DataFrame with the specified columns
columns = [
"Random Forest with BoW",
"Random Forest with TF-IDF",
"LSTM with Keras Tokenizer",
"LSTM with GloVe embedding",
]
accuracy_df = pd.DataFrame(columns=columns)
# Define a list where the train & test accuracy scores for each model will be stored before tranfering to the dataframe.
accuracy_scores_train = []
accuracy_scores_test = []
# Vectorization (Convert our corpus (text data) to numbers).
# Code to initialize the CountVectorizer function with max_ features = 5000.
bow_vec = CountVectorizer(max_features=5000)
# Code to fit and transrofm the count_vec variable on the text column
bow_features = bow_vec.fit_transform(model_df["text"])
# Code to convert the datafram into array
bow_features = bow_features.toarray()
bow_features.shape # Code to check the shape of the data features
(14604, 5000)
X = bow_features # Code to get the independent variable stored as X
y = model_df["airline_sentiment"] # Code to get the dependent variable stored as Y
# Split data into training and testing set.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
# Using Random Forest to build model for the classification of reviews.
rf_bow = RandomForestClassifier(
n_estimators=10, n_jobs=4
) # Initialize the Random Forest Classifier
rf_bow = rf_bow.fit(X_train, y_train) # Fit the rf model on X_train and y_train
print(rf_bow)
print(
np.mean(cross_val_score(rf_bow, X_train, y_train, cv=10))
) # Calculate cross validation score
RandomForestClassifier(n_estimators=10, n_jobs=4) 0.7574832664757543
Observations
# Finding optimal number of base learners using k-fold CV ->
base_ln = [x for x in range(1, 25)]
# K-Fold Cross - validation .
cv_scores = [] # Initializing a emptry list to store the score
for b in base_ln:
clf_bow = RandomForestClassifier(
n_estimators=b
) # Code to apply Rondome Forest Classifier
scores = cross_val_score(
clf_bow, X_train, y_train, cv=10, scoring="accuracy"
) # Code to find the cross-validation score on the classifier (clf) for accuracy
cv_scores.append(scores.mean()) # Append the scores to cv_scores list
# plot the error as k increases
error = [1 - x for x in cv_scores] # Error corresponds to each number of estimator
optimal_learners = base_ln[
error.index(min(error))
] # Selection of optimal number of n_estimator corresponds to minimum error.
plt.plot(
base_ln, error
) # Plot between each number of estimator and misclassification error
xy = (optimal_learners, min(error))
plt.annotate("(%s, %s)" % xy, xy=xy, textcoords="data")
plt.xlabel("Number of base learners")
plt.ylabel("Misclassification Error")
plt.show()
# Train the best model and calculating accuracy on test data .
clf_bow = RandomForestClassifier(
n_estimators=optimal_learners
) # Initialize the Random Forest classifier with optimal learners
clf_bow.fit(X_train, y_train) # Fit the classifer on X_train and y_train
bow_trainscore = clf_bow.score(
X_train, y_train
) # Find the score on X_train and y_train
bow_testscore = clf_bow.score(X_test, y_test) # Find the score on X_test and y_test
accuracy_scores_train.append(bow_trainscore)
accuracy_scores_test.append(bow_testscore)
print("Train Score: ", accuracy_scores_train)
print("Test Score: ", accuracy_scores_test)
Train Score: [0.9917824300528273] Test Score: [0.7535371976266545]
Observations
# Predict the result for test data using the model built above.
result = clf_bow.predict(
X_test
) # Code to predict the X_test data using the model built above (forest)
# Print and plot Confusion matirx
conf_mat = confusion_matrix(
y_test, result
) # Code to calculate the confusion matrix between test data and result
print(conf_mat) # Print confusion matrix
[[2479 191 71] [ 426 439 84] [ 180 128 384]]
# Plot the confusion matrix
df_cm = pd.DataFrame(
conf_mat,
index=[i for i in ["positive", "negative", "neutral"]],
columns=[i for i in ["positive", "negative", "neutral"]],
)
plt.figure(figsize=(10, 7))
sns.heatmap(df_cm, annot=True, fmt="g")
<Axes: >
# Generate the classification report
report = classification_report(y_test, result)
# Print the classification report
print(report)
precision recall f1-score support
negative 0.80 0.90 0.85 2741
neutral 0.58 0.46 0.51 949
positive 0.71 0.55 0.62 692
accuracy 0.75 4382
macro avg 0.70 0.64 0.66 4382
weighted avg 0.74 0.75 0.74 4382
Observations
all_features = (
bow_vec.get_feature_names_out()
) # Instantiate the feature from the vectorizer
top_features = (
"" # Addition of top 40 feature into top_feature after training the model
)
feat = clf_bow.feature_importances_
features = np.argsort(feat)[::-1]
for i in features[0:40]:
top_features += all_features[i]
top_features += ","
print(top_features)
print(" ")
print(" ")
# Complete the code by applying wordcloud on top features
wordcloud = WordCloud(background_color="black", width=3000, height=2500).generate(
top_features
)
thank,not,great,jetblue,usairways,delay,http,flight,southwestair,unite,hours,hold,awesome,love,get,cancel,bag,americanair,wait,virginamerica,best,call,amaze,hour,dm,time,lose,service,please,customer,help,appreciate,go,hrs,make,need,follow,plane,never,still,
plt.figure(figsize=(14, 11), frameon=True)
plt.imshow(wordcloud)
plt.axis("off")
plt.title("Top 40 features WordCloud", fontsize=30)
plt.show()
Observations
airlines column...considering that jetblue is not a part of our analysis. Further analysis needed.# Using TfidfVectorizer to convert text data to numbers.
tfidf_vect = TfidfVectorizer(max_features=5000) # Code to initialize the TF-IDF vector function with max_features = 5000.
tfidf_features = tfidf_vect.fit_transform(data['text']) # Fit_transform the tf idf function on the text column
tfidf_features = tfidf_features.toarray() # Code to convert the dataframe into array
tfidf_features.shape # Code to check the shape of the data features
(14640, 5000)
X = tfidf_features # Code to get the independent variable (data_features) stored as X
y = data[
"airline_sentiment"
] # Code to get the dependent variable (airline_sentiment) stored as y
# Split data into training and testing set.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
# Using Random Forest to build model for the classification of reviews.
rf_tfidf = RandomForestClassifier(
n_estimators=10, n_jobs=4
) # Initialize the Random Forest Classifier
rf_tfidf = rf_tfidf.fit(
X_train, y_train
) # Fit the forest variable on X_train and y_train
print(rf_tfidf)
print(
np.mean(cross_val_score(rf_tfidf, X_train, y_train, cv=10))
) # Calculate cross validation score
RandomForestClassifier(n_estimators=10, n_jobs=4) 0.7307772484756098
Observations
# Finding optimal number of base learners using k-fold CV ->
base_ln = [x for x in range(1, 25)]
# K-Fold Cross - validation .
cv_scores = [] # Initializing a emptry list to store the score
for b in base_ln:
clf_tfidf = RandomForestClassifier(
n_estimators=b
) # Code to apply Rondome Forest Classifier
scores = cross_val_score(
clf_tfidf, X_train, y_train, cv=10, scoring="accuracy"
) # Code to find the cross-validation score on the classifier (clf) for accuracy
cv_scores.append(scores.mean()) # Append the scores to cv_scores list
# Plot the misclassification error for each of estimators
error = [1 - x for x in cv_scores] # Error corresponds to each number of estimator
optimal_learners = base_ln[
error.index(min(error))
] # Selection of optimal number of n_estimator corresponds to minimum error.
plt.plot(
base_ln, error
) # Plot between each number of estimator and misclassification error
xy = (optimal_learners, min(error))
plt.annotate("(%s, %s)" % xy, xy=xy, textcoords="data")
plt.xlabel("Number of base learners")
plt.ylabel("Misclassification Error")
plt.show()
# Train the best model and calculating accuracy on test data .
clf_tfidf = RandomForestClassifier(
n_estimators=optimal_learners
) # Initialize the Random Forest classifier with optimal learners
clf_tfidf.fit(X_train, y_train) # Fit the classifer on X_train and y_train
tfidf_trainscore = clf_tfidf.score(
X_train, y_train
) # Find the score on X_train and y_train
tfidf_testscore = clf_tfidf.score(X_test, y_test) # Find the score on X_test and y_test
accuracy_scores_train.append(tfidf_trainscore)
accuracy_scores_test.append(tfidf_testscore)
print("Train Score: ", accuracy_scores_train)
print("Test Score: ", accuracy_scores_test)
Train Score: [0.9917824300528273, 0.9942427790788446] Test Score: [0.7535371976266545, 0.7470400728597449]
Observations
# Predict the result for test data using the model built above.
result = clf_tfidf.predict(
X_test
) # Code to predict the X_test data using the model built above (forest)
# Plot the confusion matrix
conf_mat = confusion_matrix(
y_test, result
) # Complete the code to calculate the confusion matrix between test data and restust
df_cm = pd.DataFrame(
conf_mat,
index=[i for i in ["positive", "negative", "neutral"]],
columns=[i for i in ["positive", "negative", "neutral"]],
)
plt.figure(figsize=(10, 7))
sns.heatmap(
df_cm, annot=True, fmt="g"
) # Complete the code to plot the heatmap of the confusion matrix
<Axes: >
# Generate the classification report
report = classification_report(y_test, result)
# Print the classification report
print(report)
precision recall f1-score support
negative 0.75 0.96 0.84 2741
neutral 0.66 0.36 0.46 936
positive 0.82 0.45 0.58 715
accuracy 0.75 4392
macro avg 0.74 0.59 0.63 4392
weighted avg 0.74 0.75 0.72 4392
Observations
all_features = (
tfidf_vect.get_feature_names_out()
) # Instantiate the feature from the vectorizer
top_features = (
"" # Addition of top 40 feature into top_feature after training the model
)
feat = clf_tfidf.feature_importances_
features = np.argsort(feat)[::-1]
for i in features[0:40]:
top_features += all_features[i]
top_features += ", "
print(top_features)
print(" ")
print(" ")
# Complete the code by applying wordcloud on top features
wordcloud = WordCloud(background_color="black", width=2000, height=1500).generate(
top_features
)
thank, thanks, southwestair, jetblue, usairways, americanair, you, to, united, great, http, the, on, flight, not, no, co, for, and, is, virginamerica, hold, awesome, your, love, my, can, it, in, cancelled, dm, of, but, from, that, have, amazing, will, delayed, be,
# Display the generated image:
plt.figure(1, figsize=(14, 11), frameon="equal")
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.title("Top 40 features WordCloud", fontsize=30)
plt.show()
Observations
airlines column. Further analysis needed.vocab_size = 5000 #Vocabulary Size
oov_token = "<OOV>" #Out of Vocabulary Token. Place holder for OOV words
max_len = 50 #Maximum length for padding
def tokenize_pad_sequences(text):
'''
This function tokenize the input text into sequnences of intergers and then
pad each sequence to the same length
'''
# Text tokenization
tokenizer = Tokenizer(num_words=vocab_size, oov_token=oov_token)
tokenizer.fit_on_texts(text)
# Transforms text to a sequence of integers
X = tokenizer.texts_to_sequences(text)
# Pad sequences to the same length
X = pad_sequences(X, padding='post', maxlen=max_len)
# return sequences
return X, tokenizer
X, tokenizer = tokenize_pad_sequences(
model_df["text"]
) # Vectorize, Pad and Sequence the corpus
print(
"Before Tokenization & Padding \n", model_df["text"][22]
) # Print a sample document before tokenization & padding
print(
"\nAfter Tokenization & Padding \n", X[22]
) # Print a sample document after tokenization & padding
Before Tokenization & Padding
virginamerica love hipster innovation feel good brand
After Tokenization & Padding
[ 36 70 4820 2339 311 80 1021 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0]
# Get the Word Index: the mapping of the words to numbers.
word_index = tokenizer.word_index
# Convert the dictionary to a list of key-value pairs to display first 40 elements
first_40_elements = list(word_index.items())
first_40_elements[:40] # show first 40
[('<OOV>', 1),
('flight', 2),
('unite', 3),
('not', 4),
('usairways', 5),
('americanair', 6),
('southwestair', 7),
('jetblue', 8),
('get', 9),
('thank', 10),
('http', 11),
('cancel', 12),
('service', 13),
('delay', 14),
('time', 15),
('help', 16),
('go', 17),
('fly', 18),
('call', 19),
('bag', 20),
('wait', 21),
('customer', 22),
('us', 23),
('would', 24),
('hold', 25),
('make', 26),
('need', 27),
('hours', 28),
('plane', 29),
('try', 30),
('still', 31),
('please', 32),
('one', 33),
('gate', 34),
('back', 35),
('virginamerica', 36),
('seat', 37),
('take', 38),
('say', 39),
('flightled', 40)]
y = pd.get_dummies(model_df["airline_sentiment"])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
print("Train Set ->", X_train.shape, y_train.shape)
print("Test Set ->", X_test.shape, y_test.shape)
Train Set -> (11683, 50) (11683, 3) Test Set -> (2921, 50) (2921, 3)
#####
# Code to build a Long-Short Term memory (LSTM) Neural Network
# This model will be used on the vectors generated by the Keras Tokenizer
#####
embedding_size = 32
rnn_model = Sequential()
# Adding Embedding layer with vocab_size, embedding vectors of embedding_size, and input size of the train data
rnn_model.add(Embedding(vocab_size, embedding_size, input_length=max_len))
# Adding SpatialDropout1D with ratio of 0.2
rnn_model.add(SpatialDropout1D(0.2))
# Adding Bidirectional LSTM layer
rnn_model.add(Bidirectional(LSTM(128)))
# Adding a dropout ratio of 0.4
rnn_model.add(Dropout(0.4))
# Adding a Dense layer
rnn_model.add(Dense(256, activation='relu'))
# Adding a dropout ratio of 0.4
rnn_model.add(Dropout(0.4))
# Adding output layer with 3 units with softmax as activation function
rnn_model.add(Dense(3, activation="softmax"))
rnn_model.summary()
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding (Embedding) (None, 50, 32) 160000
spatial_dropout1d (SpatialD (None, 50, 32) 0
ropout1D)
bidirectional (Bidirectiona (None, 256) 164864
l)
dropout (Dropout) (None, 256) 0
dense (Dense) (None, 256) 65792
dropout_1 (Dropout) (None, 256) 0
dense_1 (Dense) (None, 3) 771
=================================================================
Total params: 391,427
Trainable params: 391,427
Non-trainable params: 0
_________________________________________________________________
# Compile model
adam = Adam(lr=0.001)
rnn_model.compile(
loss="categorical_crossentropy",
optimizer=adam,
metrics=["accuracy"],
)
# Setup callbacks
callbacks = [
EarlyStopping(
monitor="loss", mode="min", verbose=1, patience=10
), # stop the training process once the model stops improving.
ModelCheckpoint(
filepath="model_weights.h5", save_best_only=True, monitor="loss", mode="min"
), # for saving the best model during training
]
# Train model
epochs = 50
history = rnn_model.fit(
X_train,
y_train,
validation_split=0.2,
epochs=epochs,
verbose=1,
callbacks=callbacks,
)
Epoch 1/50 293/293 [==============================] - 21s 59ms/step - loss: 0.7022 - accuracy: 0.7037 - val_loss: 0.5808 - val_accuracy: 0.7578 Epoch 2/50 293/293 [==============================] - 17s 59ms/step - loss: 0.5004 - accuracy: 0.8037 - val_loss: 0.5302 - val_accuracy: 0.7826 Epoch 3/50 293/293 [==============================] - 17s 57ms/step - loss: 0.4041 - accuracy: 0.8446 - val_loss: 0.5275 - val_accuracy: 0.7920 Epoch 4/50 293/293 [==============================] - 17s 56ms/step - loss: 0.3424 - accuracy: 0.8727 - val_loss: 0.5470 - val_accuracy: 0.7895 Epoch 5/50 293/293 [==============================] - 17s 59ms/step - loss: 0.3008 - accuracy: 0.8908 - val_loss: 0.6084 - val_accuracy: 0.7873 Epoch 6/50 293/293 [==============================] - 17s 59ms/step - loss: 0.2622 - accuracy: 0.9016 - val_loss: 0.6732 - val_accuracy: 0.7805 Epoch 7/50 293/293 [==============================] - 18s 61ms/step - loss: 0.2355 - accuracy: 0.9146 - val_loss: 0.7452 - val_accuracy: 0.7728 Epoch 8/50 293/293 [==============================] - 17s 58ms/step - loss: 0.2200 - accuracy: 0.9191 - val_loss: 0.7636 - val_accuracy: 0.7664 Epoch 9/50 293/293 [==============================] - 18s 61ms/step - loss: 0.1925 - accuracy: 0.9315 - val_loss: 0.8345 - val_accuracy: 0.7595 Epoch 10/50 293/293 [==============================] - 18s 60ms/step - loss: 0.1776 - accuracy: 0.9376 - val_loss: 0.8412 - val_accuracy: 0.7694 Epoch 11/50 293/293 [==============================] - 17s 56ms/step - loss: 0.1702 - accuracy: 0.9383 - val_loss: 0.9212 - val_accuracy: 0.7617 Epoch 12/50 293/293 [==============================] - 17s 59ms/step - loss: 0.1507 - accuracy: 0.9474 - val_loss: 0.9909 - val_accuracy: 0.7574 Epoch 13/50 293/293 [==============================] - 17s 58ms/step - loss: 0.1475 - accuracy: 0.9470 - val_loss: 1.0244 - val_accuracy: 0.7612 Epoch 14/50 293/293 [==============================] - 21s 73ms/step - loss: 0.1337 - accuracy: 0.9532 - val_loss: 1.0815 - val_accuracy: 0.7501 Epoch 15/50 293/293 [==============================] - 17s 58ms/step - loss: 0.1229 - accuracy: 0.9563 - val_loss: 1.1871 - val_accuracy: 0.7604 Epoch 16/50 293/293 [==============================] - 16s 56ms/step - loss: 0.1164 - accuracy: 0.9573 - val_loss: 1.3097 - val_accuracy: 0.7544 Epoch 17/50 293/293 [==============================] - 17s 57ms/step - loss: 0.1150 - accuracy: 0.9590 - val_loss: 1.2128 - val_accuracy: 0.7552 Epoch 18/50 293/293 [==============================] - 17s 58ms/step - loss: 0.1030 - accuracy: 0.9619 - val_loss: 1.3846 - val_accuracy: 0.7454 Epoch 19/50 293/293 [==============================] - 19s 63ms/step - loss: 0.1054 - accuracy: 0.9618 - val_loss: 1.2902 - val_accuracy: 0.7480 Epoch 20/50 293/293 [==============================] - 17s 58ms/step - loss: 0.0982 - accuracy: 0.9665 - val_loss: 1.5996 - val_accuracy: 0.7480 Epoch 21/50 293/293 [==============================] - 18s 60ms/step - loss: 0.0963 - accuracy: 0.9645 - val_loss: 1.3531 - val_accuracy: 0.7493 Epoch 22/50 293/293 [==============================] - 18s 60ms/step - loss: 0.0944 - accuracy: 0.9670 - val_loss: 1.4976 - val_accuracy: 0.7398 Epoch 23/50 293/293 [==============================] - 18s 60ms/step - loss: 0.0835 - accuracy: 0.9697 - val_loss: 1.5929 - val_accuracy: 0.7471 Epoch 24/50 293/293 [==============================] - 18s 60ms/step - loss: 0.0907 - accuracy: 0.9686 - val_loss: 1.5103 - val_accuracy: 0.7535 Epoch 25/50 293/293 [==============================] - 17s 58ms/step - loss: 0.0791 - accuracy: 0.9710 - val_loss: 1.6715 - val_accuracy: 0.7480 Epoch 26/50 293/293 [==============================] - 17s 57ms/step - loss: 0.0795 - accuracy: 0.9706 - val_loss: 1.6864 - val_accuracy: 0.7471 Epoch 27/50 293/293 [==============================] - 17s 58ms/step - loss: 0.0782 - accuracy: 0.9695 - val_loss: 1.6679 - val_accuracy: 0.7441 Epoch 28/50 293/293 [==============================] - 17s 58ms/step - loss: 0.0765 - accuracy: 0.9734 - val_loss: 1.6301 - val_accuracy: 0.7368 Epoch 29/50 293/293 [==============================] - 17s 58ms/step - loss: 0.0679 - accuracy: 0.9766 - val_loss: 1.8098 - val_accuracy: 0.7501 Epoch 30/50 293/293 [==============================] - 17s 57ms/step - loss: 0.0676 - accuracy: 0.9769 - val_loss: 1.5980 - val_accuracy: 0.7565 Epoch 31/50 293/293 [==============================] - 17s 59ms/step - loss: 0.0671 - accuracy: 0.9760 - val_loss: 1.6633 - val_accuracy: 0.7458 Epoch 32/50 293/293 [==============================] - 16s 56ms/step - loss: 0.0669 - accuracy: 0.9769 - val_loss: 1.9026 - val_accuracy: 0.7450 Epoch 33/50 293/293 [==============================] - 16s 56ms/step - loss: 0.0689 - accuracy: 0.9750 - val_loss: 1.7012 - val_accuracy: 0.7467 Epoch 34/50 293/293 [==============================] - 16s 56ms/step - loss: 0.0591 - accuracy: 0.9786 - val_loss: 1.9459 - val_accuracy: 0.7343 Epoch 35/50 293/293 [==============================] - 16s 55ms/step - loss: 0.0622 - accuracy: 0.9776 - val_loss: 1.8143 - val_accuracy: 0.7441 Epoch 36/50 293/293 [==============================] - 16s 55ms/step - loss: 0.0619 - accuracy: 0.9777 - val_loss: 2.0195 - val_accuracy: 0.7454 Epoch 37/50 293/293 [==============================] - 16s 55ms/step - loss: 0.0572 - accuracy: 0.9795 - val_loss: 1.9636 - val_accuracy: 0.7505 Epoch 38/50 293/293 [==============================] - 17s 58ms/step - loss: 0.0625 - accuracy: 0.9796 - val_loss: 1.7665 - val_accuracy: 0.7463 Epoch 39/50 293/293 [==============================] - 16s 55ms/step - loss: 0.0584 - accuracy: 0.9783 - val_loss: 1.8420 - val_accuracy: 0.7424 Epoch 40/50 293/293 [==============================] - 18s 61ms/step - loss: 0.0555 - accuracy: 0.9804 - val_loss: 1.9567 - val_accuracy: 0.7415 Epoch 41/50 293/293 [==============================] - 18s 61ms/step - loss: 0.0510 - accuracy: 0.9815 - val_loss: 1.9890 - val_accuracy: 0.7313 Epoch 42/50 293/293 [==============================] - 17s 58ms/step - loss: 0.0543 - accuracy: 0.9807 - val_loss: 2.0829 - val_accuracy: 0.7398 Epoch 43/50 293/293 [==============================] - 16s 56ms/step - loss: 0.0510 - accuracy: 0.9819 - val_loss: 2.0434 - val_accuracy: 0.7356 Epoch 44/50 293/293 [==============================] - 17s 58ms/step - loss: 0.0532 - accuracy: 0.9811 - val_loss: 2.1640 - val_accuracy: 0.7411 Epoch 45/50 293/293 [==============================] - 17s 58ms/step - loss: 0.0479 - accuracy: 0.9838 - val_loss: 2.1204 - val_accuracy: 0.7351 Epoch 46/50 293/293 [==============================] - 17s 59ms/step - loss: 0.0532 - accuracy: 0.9804 - val_loss: 1.9121 - val_accuracy: 0.7351 Epoch 47/50 293/293 [==============================] - 17s 57ms/step - loss: 0.0480 - accuracy: 0.9834 - val_loss: 2.0331 - val_accuracy: 0.7415 Epoch 48/50 293/293 [==============================] - 16s 56ms/step - loss: 0.0513 - accuracy: 0.9815 - val_loss: 2.1419 - val_accuracy: 0.7321 Epoch 49/50 293/293 [==============================] - 17s 57ms/step - loss: 0.0640 - accuracy: 0.9780 - val_loss: 2.0328 - val_accuracy: 0.7338 Epoch 50/50 293/293 [==============================] - 17s 56ms/step - loss: 0.0415 - accuracy: 0.9845 - val_loss: 2.3481 - val_accuracy: 0.7386
# Evaluate the best model on the test set
loss, LSTM_testscore = rnn_model.evaluate(X_test, y_test, verbose=0)
# Evaluate the best model on the trainset set
loss, LSTM_trainscore = rnn_model.evaluate(X_train, y_train, verbose=0)
accuracy_scores_train.append(LSTM_trainscore)
accuracy_scores_test.append(LSTM_testscore)
print("Train Acurracy Score: ", accuracy_scores_train)
print("Test Acurracy Score: ", accuracy_scores_test)
Train Acurracy Score: [0.9917824300528273, 0.9942427790788446, 0.9405118823051453] Test Acurracy Score: [0.7535371976266545, 0.7470400728597449, 0.734337568283081]
def plot_training_hist(history):
"""Function to plot history for accuracy and loss"""
fig, ax = plt.subplots(1, 2, figsize=(10, 4))
# first plot
ax[0].plot(history.history["accuracy"])
ax[0].plot(history.history["val_accuracy"])
ax[0].set_title("Model Accuracy")
ax[0].set_xlabel("epoch")
ax[0].set_ylabel("accuracy")
ax[0].legend(["train", "validation"], loc="best")
# second plot
ax[1].plot(history.history["loss"])
ax[1].plot(history.history["val_loss"])
ax[1].set_title("Model Loss")
ax[1].set_xlabel("epoch")
ax[1].set_ylabel("loss")
ax[1].legend(["train", "validation"], loc="best")
plot_training_hist(history)
Observations
# create the dictionary with those embeddings
glove_embeddings_index = {}
glove_file = open(
"glove/glove.6B.50d.txt"
) # The 50 in the name is the same as the maximum length chosen for padding.
for line in glove_file:
values = line.split()
word = values[0]
coefs = np.asarray(values[1:], dtype="float32")
glove_embeddings_index[word] = coefs
glove_file.close()
print("Found %s word vectors." % len(glove_embeddings_index))
Found 400000 word vectors.
# create a word embedding matrix for each word in the word index
# If a word doesn't have an embedding in GloVe it will be presented with a zero matrix.
embedding_matrix = np.zeros((len(word_index) + 1, max_len))
for word, i in word_index.items():
embedding_vector = glove_embeddings_index.get(word)
if embedding_vector is not None:
# words not found in embedding index will be all-zeros.
embedding_matrix[i] = embedding_vector
# Show any element of the matrix to confirm content
embedding_matrix[22]
array([ 0.50120002, 0.052743 , 0.71052998, 0.46959001, 1.05519998,
0.023635 , -0.68181998, 0.18503 , 0.83736002, -0.055731 ,
0.37808999, 0.43691 , -0.10603 , -0.31305999, 0.060604 ,
-0.1005 , -1.15310001, 0.37011999, 1.07799995, -1.28260005,
0.83467001, -0.098129 , -0.85596001, 0.70467001, -0.012172 ,
-0.97125 , -0.18861 , -0.16795 , 0.74255002, 0.039095 ,
2.53259993, 0.75392002, 0.84202999, -0.12890001, 0.11043 ,
-0.39398 , -0.65667999, 0.0034273 , 0.04577 , -0.43445 ,
0.75432998, -0.27877 , -0.030205 , 0.55124998, -0.18464001,
-0.66623998, 0.13788 , 0.99896997, 0.24781001, 1.18610001])
The embedding_matrix obtained above is used as the weights of an embedding layer in our neural network model.
We set the trainable parameter of this layer to False so that is not trained.
embedding_layer = Embedding(
input_dim=len(word_index)
+ 1, # 1 is added because 0 is usually reserved for padding
output_dim=max_len, # dimension of the dense embedding
weights=[embedding_matrix],
input_length=max_len, # length of the input sequences
trainable=False,
)
# Define a Sequential model
glove_lstm_model = Sequential()
# Add the Embedding layer
glove_lstm_model.add(embedding_layer)
# Add the first Bidirectional LSTM layer
glove_lstm_model.add(Bidirectional(LSTM(150, return_sequences=True)))
# Adding a dropout ratio of 0.4
rnn_model.add(Dropout(0.4))
# Add the second Bidirectional LSTM layer
glove_lstm_model.add(Bidirectional(LSTM(150)))
# Adding a dropout ratio of 0.4
rnn_model.add(Dropout(0.4))
# Add a Dense layer with ReLU activation
glove_lstm_model.add(Dense(128, activation="relu"))
# Adding a dropout ratio of 0.4
rnn_model.add(Dropout(0.4))
# Add the final Dense layer with softmax activation
glove_lstm_model.add(Dense(3, activation="softmax"))
# Compile the model
adam = Adam(lr = 0.0001) # Reduce the learning rate in an attempt to aid model convergence
glove_lstm_model.compile(loss='categorical_crossentropy',optimizer=adam,metrics=['accuracy'])
#####
# Train the model
#####
# Setup callbacks
callbacks = [
EarlyStopping(
monitor="loss", mode="min", verbose=1, patience=10
), # stop the training process once the model stops improving.
ModelCheckpoint(
filepath="model_weights2.h5", save_best_only=True, monitor="loss", mode="min"
), # for saving the best model during training
]
num_epochs = 50 # Number of epochs
# Train model
history = glove_lstm_model.fit(
X_train,
y_train,
epochs=num_epochs,
validation_split=0.2,
callbacks=callbacks,
verbose=1,
)
Epoch 1/50 293/293 [==============================] - 52s 165ms/step - loss: 0.7362 - accuracy: 0.6966 - val_loss: 0.6365 - val_accuracy: 0.7403 Epoch 2/50 293/293 [==============================] - 48s 162ms/step - loss: 0.6208 - accuracy: 0.7506 - val_loss: 0.6164 - val_accuracy: 0.7420 Epoch 3/50 293/293 [==============================] - 49s 167ms/step - loss: 0.6020 - accuracy: 0.7579 - val_loss: 0.6126 - val_accuracy: 0.7488 Epoch 4/50 293/293 [==============================] - 48s 163ms/step - loss: 0.5887 - accuracy: 0.7634 - val_loss: 0.6061 - val_accuracy: 0.7475 Epoch 5/50 293/293 [==============================] - 50s 170ms/step - loss: 0.5790 - accuracy: 0.7682 - val_loss: 0.6065 - val_accuracy: 0.7535 Epoch 6/50 293/293 [==============================] - 49s 166ms/step - loss: 0.5713 - accuracy: 0.7715 - val_loss: 0.6021 - val_accuracy: 0.7493 Epoch 7/50 293/293 [==============================] - 49s 166ms/step - loss: 0.5650 - accuracy: 0.7735 - val_loss: 0.6005 - val_accuracy: 0.7535 Epoch 8/50 293/293 [==============================] - 47s 161ms/step - loss: 0.5563 - accuracy: 0.7759 - val_loss: 0.5987 - val_accuracy: 0.7548 Epoch 9/50 293/293 [==============================] - 50s 169ms/step - loss: 0.5467 - accuracy: 0.7813 - val_loss: 0.5933 - val_accuracy: 0.7565 Epoch 10/50 293/293 [==============================] - 49s 168ms/step - loss: 0.5366 - accuracy: 0.7829 - val_loss: 0.5886 - val_accuracy: 0.7544 Epoch 11/50 293/293 [==============================] - 50s 170ms/step - loss: 0.5319 - accuracy: 0.7845 - val_loss: 0.5953 - val_accuracy: 0.7510 Epoch 12/50 293/293 [==============================] - 47s 160ms/step - loss: 0.5193 - accuracy: 0.7894 - val_loss: 0.5817 - val_accuracy: 0.7625 Epoch 13/50 293/293 [==============================] - 48s 164ms/step - loss: 0.5090 - accuracy: 0.7934 - val_loss: 0.5838 - val_accuracy: 0.7638 Epoch 14/50 293/293 [==============================] - 45s 153ms/step - loss: 0.5004 - accuracy: 0.7946 - val_loss: 0.6114 - val_accuracy: 0.7578 Epoch 15/50 293/293 [==============================] - 47s 160ms/step - loss: 0.4937 - accuracy: 0.7975 - val_loss: 0.5829 - val_accuracy: 0.7552 Epoch 16/50 293/293 [==============================] - 53s 179ms/step - loss: 0.4814 - accuracy: 0.8028 - val_loss: 0.5775 - val_accuracy: 0.7587 Epoch 17/50 293/293 [==============================] - 45s 155ms/step - loss: 0.4789 - accuracy: 0.8047 - val_loss: 0.5778 - val_accuracy: 0.7672 Epoch 18/50 293/293 [==============================] - 45s 154ms/step - loss: 0.4697 - accuracy: 0.8117 - val_loss: 0.5802 - val_accuracy: 0.7642 Epoch 19/50 293/293 [==============================] - 44s 151ms/step - loss: 0.4615 - accuracy: 0.8102 - val_loss: 0.5924 - val_accuracy: 0.7629 Epoch 20/50 293/293 [==============================] - 44s 150ms/step - loss: 0.4516 - accuracy: 0.8147 - val_loss: 0.5922 - val_accuracy: 0.7617 Epoch 21/50 293/293 [==============================] - 44s 152ms/step - loss: 0.4456 - accuracy: 0.8170 - val_loss: 0.6074 - val_accuracy: 0.7595 Epoch 22/50 293/293 [==============================] - 47s 160ms/step - loss: 0.4367 - accuracy: 0.8223 - val_loss: 0.6018 - val_accuracy: 0.7672 Epoch 23/50 293/293 [==============================] - 45s 153ms/step - loss: 0.4263 - accuracy: 0.8266 - val_loss: 0.6099 - val_accuracy: 0.7582 Epoch 24/50 293/293 [==============================] - 45s 155ms/step - loss: 0.4195 - accuracy: 0.8338 - val_loss: 0.6085 - val_accuracy: 0.7599 Epoch 25/50 293/293 [==============================] - 46s 156ms/step - loss: 0.4141 - accuracy: 0.8331 - val_loss: 0.6143 - val_accuracy: 0.7621 Epoch 26/50 293/293 [==============================] - 46s 157ms/step - loss: 0.4030 - accuracy: 0.8350 - val_loss: 0.6099 - val_accuracy: 0.7677 Epoch 27/50 293/293 [==============================] - 46s 157ms/step - loss: 0.3906 - accuracy: 0.8423 - val_loss: 0.6358 - val_accuracy: 0.7450 Epoch 28/50 293/293 [==============================] - 46s 156ms/step - loss: 0.3834 - accuracy: 0.8477 - val_loss: 0.6565 - val_accuracy: 0.7557 Epoch 29/50 293/293 [==============================] - 44s 150ms/step - loss: 0.3801 - accuracy: 0.8462 - val_loss: 0.6323 - val_accuracy: 0.7672 Epoch 30/50 293/293 [==============================] - 44s 150ms/step - loss: 0.3687 - accuracy: 0.8533 - val_loss: 0.6281 - val_accuracy: 0.7552 Epoch 31/50 293/293 [==============================] - 46s 157ms/step - loss: 0.3681 - accuracy: 0.8529 - val_loss: 0.6490 - val_accuracy: 0.7548 Epoch 32/50 293/293 [==============================] - 44s 151ms/step - loss: 0.3467 - accuracy: 0.8630 - val_loss: 0.6791 - val_accuracy: 0.7356 Epoch 33/50 293/293 [==============================] - 44s 151ms/step - loss: 0.3442 - accuracy: 0.8646 - val_loss: 0.6570 - val_accuracy: 0.7578 Epoch 34/50 293/293 [==============================] - 45s 153ms/step - loss: 0.3305 - accuracy: 0.8716 - val_loss: 0.6940 - val_accuracy: 0.7458 Epoch 35/50 293/293 [==============================] - 45s 154ms/step - loss: 0.3251 - accuracy: 0.8726 - val_loss: 0.7112 - val_accuracy: 0.7604 Epoch 36/50 293/293 [==============================] - 44s 150ms/step - loss: 0.3170 - accuracy: 0.8729 - val_loss: 0.6941 - val_accuracy: 0.7531 Epoch 37/50 293/293 [==============================] - 47s 160ms/step - loss: 0.3022 - accuracy: 0.8808 - val_loss: 0.7355 - val_accuracy: 0.7437 Epoch 38/50 293/293 [==============================] - 45s 155ms/step - loss: 0.2985 - accuracy: 0.8844 - val_loss: 0.7325 - val_accuracy: 0.7591 Epoch 39/50 293/293 [==============================] - 46s 156ms/step - loss: 0.2909 - accuracy: 0.8850 - val_loss: 0.7277 - val_accuracy: 0.7544 Epoch 40/50 293/293 [==============================] - 47s 160ms/step - loss: 0.2792 - accuracy: 0.8898 - val_loss: 0.7757 - val_accuracy: 0.7587 Epoch 41/50 293/293 [==============================] - 44s 152ms/step - loss: 0.2682 - accuracy: 0.8938 - val_loss: 0.8074 - val_accuracy: 0.7471 Epoch 42/50 293/293 [==============================] - 46s 158ms/step - loss: 0.2571 - accuracy: 0.8996 - val_loss: 0.8324 - val_accuracy: 0.7595 Epoch 43/50 293/293 [==============================] - 45s 152ms/step - loss: 0.2669 - accuracy: 0.8959 - val_loss: 0.7999 - val_accuracy: 0.7505 Epoch 44/50 293/293 [==============================] - 45s 154ms/step - loss: 0.2455 - accuracy: 0.9057 - val_loss: 0.9056 - val_accuracy: 0.7518 Epoch 45/50 293/293 [==============================] - 45s 155ms/step - loss: 0.2353 - accuracy: 0.9118 - val_loss: 0.8951 - val_accuracy: 0.7617 Epoch 46/50 293/293 [==============================] - 50s 172ms/step - loss: 0.2247 - accuracy: 0.9147 - val_loss: 0.8693 - val_accuracy: 0.7544 Epoch 47/50 293/293 [==============================] - 45s 153ms/step - loss: 0.2231 - accuracy: 0.9134 - val_loss: 0.8986 - val_accuracy: 0.7514 Epoch 48/50 293/293 [==============================] - 46s 158ms/step - loss: 0.2140 - accuracy: 0.9181 - val_loss: 0.8952 - val_accuracy: 0.7544 Epoch 49/50 293/293 [==============================] - 46s 157ms/step - loss: 0.2080 - accuracy: 0.9198 - val_loss: 0.8959 - val_accuracy: 0.7591 Epoch 50/50 293/293 [==============================] - 45s 154ms/step - loss: 0.2016 - accuracy: 0.9239 - val_loss: 0.9788 - val_accuracy: 0.7548
# Evaluate the best model on the test set
loss, GloVe_LSTM_testscore = glove_lstm_model.evaluate(X_test, y_test, verbose=0)
# Evaluate the best model on the trainset set
loss, GloVe_LSTM_trainscore = glove_lstm_model.evaluate(X_train, y_train, verbose=0)
accuracy_scores_train.append(GloVe_LSTM_trainscore)
accuracy_scores_test.append(GloVe_LSTM_testscore)
print("Train Acurracy Score: ", accuracy_scores_train)
print("Test Acurracy Score: ", accuracy_scores_test)
Train Acurracy Score: [0.9917824300528273, 0.9942427790788446, 0.9405118823051453, 0.8915518522262573] Test Acurracy Score: [0.7535371976266545, 0.7470400728597449, 0.734337568283081, 0.7665182948112488]
def plot_training_hist(history):
"""Function to plot history for accuracy and loss"""
fig, ax = plt.subplots(1, 2, figsize=(10, 4))
# first plot
ax[0].plot(history.history["accuracy"])
ax[0].plot(history.history["val_accuracy"])
ax[0].set_title("Model Accuracy")
ax[0].set_xlabel("epoch")
ax[0].set_ylabel("accuracy")
ax[0].legend(["train", "validation"], loc="best")
# second plot
ax[1].plot(history.history["loss"])
ax[1].plot(history.history["val_loss"])
ax[1].set_title("Model Loss")
ax[1].set_xlabel("epoch")
ax[1].set_ylabel("loss")
ax[1].legend(["train", "validation"], loc="best")
plot_training_hist(history)
Observations
# Update the first two rows of the DataFrame with accuracy scores
accuracy_df.loc[0] = accuracy_scores_train
accuracy_df.loc[1] = accuracy_scores_test
# Give the index a descriptive text
index = ["Accuracy - Train Set", "Accuracy - Test Set"]
accuracy_df.index = index
# Display the updated DataFrame
accuracy_df
| Random Forest with BoW | Random Forest with TF-IDF | LSTM with Keras Tokenizer | LSTM with GloVe embedding | |
|---|---|---|---|---|
| Accuracy - Train Set | 0.991782 | 0.994243 | 0.940512 | 0.891552 |
| Accuracy - Test Set | 0.753537 | 0.747040 | 0.734338 | 0.766518 |
Random Forest with BoW and LSTM with GloVe embedding models have the best accuracy scores on test set - 75.35% & 76.65% redpectively.LSTM with GloVe embedding as our Final Model because it has the highest accuracy score of 76.65% on unseen data and also lesser overfitting.Our analysis shows that 99.73% of airline_sentiment_gold is missing. This is the Gold standard sentiment label or the groundtruth label. The Business should invest in having the Gold standard sentiment label which is a carefully curated and annotated label where human annotators have manually assigned sentiment labels (such as positive, negative, or neutral) to Tweet samples. These labels are considered reliable and accurate, and they serve as a benchmark for training and evaluating machine learning models designed to automatically classify sentiment in text data with very high accuracy on unseen data. This would have really helped both our LSTM with Keras Tokenizer & LSTM with GloVe embedding models converge much better.
Customer Service Issue dominated the negativereason. The Airlines should invest in Customer Service training and Infracstructure like deploying AI & NLP in the call-center for speedy and efficient resolution of issues.
Only 16% of the Tweets are positive. The airlines can encourage and reward customers for sharing positive experience as humans mostly share what they are unhappy about.
Airlines should encourage retweets of posivite sentiment because Retweets (and shares) are perhaps the most important feature to track on Social Media in Sentiment Analysis because it is a measure of how impactful the sentiment is, how much users relate & agree with the sentiment and how far it will spread among users. Our analysis shows that:
Geospatial Analysis can be done to drills down on airline sentiments by location. Location data (e.g., 'tweet_location' or 'user_timezone') can be used to create geospatial visualizations to understand where tweets are originating from, providing local intelligence.
Since Southwest has the most retweet of positive sentiments followed by Virgin America. The social media strategies of these 2 organization should be understudied and analyzed.
It's interesting to see "jetblue" as a more dominant word than many of the airlines represented in the airlines column...considering that jetblue is not a part of our analysis. Should Jetblue have been included in analysis? Are there relationships that needs to be unearthed? Is more data engineering and preparation needed? These are questions that needs answers. Further analysis needed.
More data, preferably "gold sentiments" Ground Truth labelled data needed to improrove the accuracy and convergence of our models.